Search CORE

25 research outputs found

Thompson sampling based Monte-Carlo planning in POMDPs

Author: Bai Aijun
Chen Xiaoping
Wu Feng
Zhang Zongzhang
Publication venue
Publication date
Field of study

Monte-Carlo tree search (MCTS) has been drawinggreat interest in recent years for planning under uncertainty. One of the key challenges is the tradeoffbetween exploration and exploitation. To addressthis, we introduce a novel online planning algorithmfor large POMDPs using Thompson sampling basedMCTS that balances between cumulative and simple regrets.The proposed algorithm — Dirichlet-Dirichlet-NormalGamma based Partially Observable Monte-Carlo Planning (D2NG-POMCP) — treats the accumulatedreward of performing an action from a beliefstate in the MCTS search tree as a random variable followingan unknown distribution with hidden parameters.Bayesian method is used to model and infer theposterior distribution of these parameters by choosingthe conjugate prior in the form of a combination of twoDirichlet and one NormalGamma distributions. Thompsonsampling is exploited to guide the action selection inthe search tree. Experimental results confirmed that ouralgorithm outperforms the state-of-the-art approacheson several common benchmark problems

Southampton (e-Prints Soton)

Policy Regularization with Dataset Constraint for Offline Reinforcement Learning

Author: Li Yi-Chen
Ran Yuhang
Yu Yang
Zhang Fuxiang
Zhang Zongzhang
Publication venue
Publication date: 15/08/2023
Field of study

We consider the problem of learning the best possible policy from a fixed dataset, known as offline Reinforcement Learning (RL). A common taxonomy of existing offline RL works is policy regularization, which typically constrains the learned policy by distribution or support of the behavior policy. However, distribution and support constraints are overly conservative since they both force the policy to choose similar actions as the behavior policy when considering particular states. It will limit the learned policy's performance, especially when the behavior policy is sub-optimal. In this paper, we find that regularizing the policy towards the nearest state-action pair can be more effective and thus propose Policy Regularization with Dataset Constraint (PRDC). When updating the policy in a given state, PRDC searches the entire dataset for the nearest state-action sample and then restricts the policy with the action of this sample. Unlike previous works, PRDC can guide the policy with proper behaviors from the dataset, allowing it to choose actions that do not appear in the dataset along with the given state. It is a softer constraint but still keeps enough conservatism from out-of-distribution actions. Empirical evidence and theoretical analysis show that PRDC can alleviate offline RL's fundamentally challenging value overestimation issue with a bounded performance gap. Moreover, on a set of locomotion and navigation tasks, PRDC achieves state-of-the-art performance compared with existing methods. Code is available at https://github.com/LAMDA-RL/PRDCComment: Accepted to ICML 202

arXiv.org e-Print Archive

Efficient Deep Reinforcement Learning via Adaptive Policy Transfer

Author: Cheng Yingfeng
Fan Changjie
Hao Jianye
Hu Yujing
Liu Wulong
Meng Zhaopeng
Peng Jiajie
Wang Weixun
Wang Zhaodong
Yang Tianpei
Zhang Zongzhang
Publication venue
Publication date: 25/05/2020
Field of study

Transfer Learning (TL) has shown great potential to accelerate Reinforcement Learning (RL) by leveraging prior knowledge from past learned policies of relevant tasks. Existing transfer approaches either explicitly computes the similarity between tasks or select appropriate source policies to provide guided explorations for the target task. However, how to directly optimize the target policy by alternatively utilizing knowledge from appropriate source policies without explicitly measuring the similarity is currently missing. In this paper, we propose a novel Policy Transfer Framework (PTF) to accelerate RL by taking advantage of this idea. Our framework learns when and which source policy is the best to reuse for the target policy and when to terminate it by modeling multi-policy transfer as the option learning problem. PTF can be easily combined with existing deep RL approaches. Experimental results show it significantly accelerates the learning process and surpasses state-of-the-art policy transfer methods in terms of learning efficiency and final performance in both discrete and continuous action spaces.Comment: Accepted by IJCAI'202

arXiv.org e-Print Archive

Crossref

Retrosynthetic Planning with Dual Value Networks

Author: Liu Guoqing
Liu Tie-Yan
Maziarz Krzysztof
Qin Tao
Segler Marwin
Tripp Austin
Xia Yingce
Xie Shufang
Xue Di
Zhang Zongzhang
Publication venue
Publication date: 01/06/2023
Field of study

Retrosynthesis, which aims to find a route to synthesize a target molecule from commercially available starting materials, is a critical task in drug discovery and materials design. Recently, the combination of ML-based single-step reaction predictors with multi-step planners has led to promising results. However, the single-step predictors are mostly trained offline to optimize the single-step accuracy, without considering complete routes. Here, we leverage reinforcement learning (RL) to improve the single-step predictor, by using a tree-shaped MDP to optimize complete routes. Specifically, we propose a novel online training algorithm, called Planning with Dual Value Networks (PDVN), which alternates between the planning phase and updating phase. In PDVN, we construct two separate value networks to predict the synthesizability and cost of molecules, respectively. To maintain the single-step accuracy, we design a two-branch network structure for the single-step predictor. On the widely-used USPTO dataset, our PDVN algorithm improves the search success rate of existing multi-step planners (e.g., increasing the success rate from 85.79% to 98.95% for Retro*, and reducing the number of model calls by half while solving 99.47% molecules for RetroGraph). Additionally, PDVN helps find shorter synthesis routes (e.g., reducing the average route length from 5.76 to 4.83 for Retro*, and from 5.63 to 4.78 for RetroGraph).Comment: Accepted to ICML 202

arXiv.org e-Print Archive

Expert Data Augmentation in Imitation Learning (Student Abstract)

Author: Han Fuguang
Zhang Zongzhang
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 06/09/2023
Field of study

Behavioral Cloning (BC) is a simple and effective imitation learning algorithm, which suffers from compounding error due to covariate shift. One solution is to use enough data for training. However, the amount of expert demonstrations available is usually limited. So we propose an effective method to augment expert demonstrations to alleviate the problem of compounding error in BC. It operates by estimating the similarity of states and filtering out transitions that can go back to the states similar to ones in expert demonstrations during the process of sampling. The data filtered out along with original expert demonstrations are used for training. We evaluate the performance of our method on several Atari tasks and continuous MuJoCo control tasks. Empirically, BC trained with the augmented data significantly outperform BC trained with the original expert demonstrations

Association for the Advancement of Artificial Intelligence: AAAI Publications

Policy-Independent Behavioral Metric-Based Representation for Deep Reinforcement Learning

Author: Liao Weijian
Yu Yang
Zhang Zongzhang
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 26/06/2023
Field of study

Behavioral metrics can calculate the distance between states or state-action pairs from the rewards and transitions difference. By virtue of their capability to filter out task-irrelevant information in theory, using them to shape a state embedding space becomes a new trend of representation learning for deep reinforcement learning (RL), especially when there are explicit distracting factors in observation backgrounds. However, due to the tight coupling between the metric and the RL policy, such metric-based methods may result in less informative embedding spaces which can weaken their aid to the baseline RL algorithm and even consume more samples to learn. We resolve this by proposing a new behavioral metric. It decouples the learning of RL policy and metric owing to its independence on RL policy. We theoretically justify its scalability to continuous state and action spaces and design a practical way to incorporate it into an RL procedure as a representation learning target. We evaluate our approach on DeepMind control tasks with default and distracting backgrounds. By statistically reliable evaluation protocols, our experiments demonstrate our approach is superior to previous metric-based methods in terms of sample efficiency and asymptotic performance in both backgrounds

Association for the Advancement of Artificial Intelligence: AAAI Publications